Goto

Collaborating Authors

 Amanat Al Asimah



Language Model Tokenizers Introduce Unfairness Between Languages

Neural Information Processing Systems

Recent language models have shown impressive multilingual performance, even when not explicitly trained for it. Despite this, there are concerns about the quality of their outputs across different languages. In this paper, we show how disparity in the treatment of different languages arises at the tokenization stage, well before a model is even invoked. The same text translated into different languages can have drastically different tok-enization lengths, with differences up to 15 times in some cases. These disparities persist even for tokenizers that are intentionally trained for multilingual support.


The drones being used in Sudan: 1,000 attacks since April 2023

Al Jazeera

During Sudan's civil war, which erupted in April 2023, both sides have increasingly relied on drones, and civilians have borne the brunt of the carnage. The conflict between the Sudanese armed forces (SAF) and the Rapid Support Forces (RSF) paramilitary group is an example of war transformed by commercially available, easily concealable unmanned aerial vehicles (UAVs), or drones. Modular, well-adapted to sanctions evasions and devastatingly effective, drones have killed scores of civilians, crippled infrastructure and plunged Sudanese cities into darkness. In this visual investigation, Al Jazeera examines the history of drone warfare in Sudan, the types of drones used by the warring sides, how they are sourced, where the attacks have occurred and the human toll. The RSF traces its origins to what at the time was a government-linked militia known as the Janjaweed.


All the countries Israel attacked in 2025: Animated map

Al Jazeera

Why is Israel still in southern Lebanon? A war to shape Lebanon's future How many countries has Israel attacked in 2025? Israel has attacked more countries than any other country this year. In 2025, Israel attacked at least six countries, including Palestine, Iran, Lebanon, Qatar, Syria, and Yemen. It also carried out strikes in Tunisian, Maltese and Greek territorial waters on aid flotillas heading for Gaza.


On Global Applicability and Location Transferability of Generative Deep Learning Models for Precipitation Downscaling

Harder, Paula, Lessig, Christian, Chantry, Matthew, Pelletier, Francis, Rolnick, David

arXiv.org Artificial Intelligence

Deep learning offers promising capabilities for the statistical downscaling of climate and weather forecasts, with generative approaches showing particular success in capturing fine-scale precipitation patterns. However, most existing models are region-specific, and their ability to generalize to unseen geographic areas remains largely unexplored. In this study, we evaluate the generalization performance of generative downscaling models across diverse regions. Using a global framework, we employ ERA5 reanalysis data as predictors and IMERG precipitation estimates at $0.1^\circ$ resolution as targets. A hierarchical location-based data split enables a systematic assessment of model performance across 15 regions around the world.


From One Attack Domain to Another: Contrastive Transfer Learning with Siamese Networks for APT Detection

Benabderrahmane, Sidahmed, Rahwan, Talal

arXiv.org Artificial Intelligence

Advanced Persistent Threats (APT) pose a major cybersecurity challenge due to their stealth, persistence, and adaptability. Traditional machine learning detectors struggle with class imbalance, high dimensional features, and scarce real world traces. They often lack transferability-performing well in the training domain but degrading in novel attack scenarios. We propose a hybrid transfer framework that integrates Transfer Learning, Explainable AI (XAI), contrastive learning, and Siamese networks to improve cross-domain generalization. An attention-based autoencoder supports knowledge transfer across domains, while Shapley Additive exPlanations (SHAP) select stable, informative features to reduce dimensionality and computational cost. A Siamese encoder trained with a contrastive objective aligns source and target representations, increasing anomaly separability and mitigating feature drift. We evaluate on real-world traces from the DARPA Transparent Computing (TC) program and augment with synthetic attack scenarios to test robustness. Across source to target transfers, the approach delivers improved detection scores with classical and deep baselines, demonstrating a scalable, explainable, and transferable solution for APT detection.




BNLI: A Linguistically-Refined Bengali Dataset for Natural Language Inference

Haque, Farah Binta, Yasin, Md, Saha, Shishir, Rafi, Md Shoaib Akhter, Sadeque, Farig

arXiv.org Artificial Intelligence

Despite the growing progress in Natural Language Inference (NLI) research, resources for the Bengali language remain extremely limited. Existing Bengali NLI datasets exhibit several inconsistencies, including annotation errors, ambiguous sentence pairs, and inadequate linguistic diversity, which hinder effective model training and evaluation. To address these limitations, we introduce BNLI, a refined and linguistically curated Bengali NLI dataset designed to support robust language understanding and inference modeling. The dataset was constructed through a rigorous annotation pipeline emphasizing semantic clarity and balance across entailment, contradiction, and neutrality classes. We benchmarked BNLI using a suite of state-of-the-art transformer-based architectures, including multilingual and Bengali-specific models, to assess their ability to capture complex semantic relations in Bengali text. The experimental findings highlight the improved reliability and interpretability achieved with BNLI, establishing it as a strong foundation for advancing research in Bengali and other low-resource language inference tasks.


Evaluating Machine Translation Datasets for Low-Web Data Languages: A Gendered Lens

Nigatu, Hellina Hailu, Mamo, Bethelhem Yemane, Balcha, Bontu Fufa, Tesfaye, Debora Taye, Zewdie, Elbethel Daniel, Nesiru, Ikram Behiru, Hailu, Jitu Ewnetu, Yayo, Senait Mengesha

arXiv.org Artificial Intelligence

As low-resourced languages are increasingly incorporated into NLP research, there is an emphasis on collecting large-scale datasets. But in prioritizing quantity over quality, we risk 1) building language technologies that perform poorly for these languages and 2) producing harmful content that perpetuates societal biases. In this paper, we investigate the quality of Machine Translation (MT) datasets for three low-resourced languages--Afan Oromo, Amharic, and Tigrinya, with a focus on the gender representation in the datasets. Our findings demonstrate that while training data has a large representation of political and religious domain text, benchmark datasets are focused on news, health, and sports. We also found a large skew towards the male gender--in names of persons, the grammatical gender of verbs, and in stereotypical depictions in the datasets. Further, we found harmful and toxic depictions against women, which were more prominent for the language with the largest amount of data, underscoring that quantity does not guarantee quality. We hope that our work inspires further inquiry into the datasets collected for low-resourced languages and prompts early mitigation of harmful content. WARNING: This paper contains discussion of NSFW content that some may find disturbing.